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Abstract 

We discuss the conditions under which Scan Statistics can be fruitfully implemented 
to signal a departure from the underlying probability model that describes the 
experimental data. It is shown that local perturbations ("bumps" or "excesses" of 
events) are better dealt within this framework and, in general, tests based on these 
statistics provide a powerful and unbiased alternative to the traditional techniques 
related with the \ 2 an d Kolmogorov distributions. Approximate formulas for the 
computation of Scan Statistics in the range of interest for high energy and nuclear 
physics applications are also presented. 
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1 Introduction 



In the last two decades, the properties of "scan statistics" [1,2] have been 
extensively investigated and this subject still represents a rapidly developing 
branch of applied probability. Nowadays, scan statistics are used in several 
areas of science and engineering to analyze the occurrence of cluster of events 
and to assess their statistical significance. Applications range from control 
theory to molecular biology while, at present, their use in physics is mainly 
limited to the analysis of time series, especially in x and 7-ray astronomy [3]. 
In fact, a common problem in nuclear and particle physics is to determine 
whether an observed cluster of events has occurred by chance or if it signals a 
departure from the underlying probability model ("null hypothesis") for the 
data. Peaks or local excesses of events can appear during energy scans at 
the colliders, in the distributions of kinematical variables as invariant masses 
or four-momentum transfer, in the interpretation of Dalitz plots, etc. The 
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traditional test statistics that are employed to challenge the null hypothesis 
after a data taking can be divided into two broad families. Binned goodness- 
of-fit tests are connected to the x 2 distribution. If the region where the excess 
is expected is known a priori because it has already been observed by another 
experiment, i.e. a confirmatory experiment against a claim of discovery has 
been performed, the level of agreement between the null hypothesis and the 
data can be evaluated in a straightforward manner through a Pearson \ 2 
test [4]. For one-dimensional distributions of the variable x (x G [A, £>]), let 
the number of observations in the candidate signal region [a, b] be k ab . The 
best estimate of the background in the region [a, b] is given by 



B 



ab 



= J B(x,e)dx (i) 



where 9 is the set of parameters describing the background distribution and 
9 is its best estimate based on events outside the interval [a, b] , so that 
cov(k ab , B ab ) = 0. Hence, the test variable can be defined as: 
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V(k ab —B ab ) being the variance of k ab —B ab . Under the null hypothesis V{k ab ) = 
B ab and V(k a b — B ab ) = B ab + &% b , i.e. the variance is the quadratic sum of 
the estimated background rate in the signal region and its error. Finally, if the 
error on the background is negligible 
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and, in the asymptotic limit, the test variable behaves as x 2 (l)- Clearly, if the 
bin number iV bin and the bin size w = (B — -4)/iV bin are specified in advance 
but no information on the position, size and width of the signal are available, 
the corresponding test is given by 
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which behaves as x 2 (^0 m the asymptotic limit. The power 1 of this test 
depends on the position and width of the signal compared to the binning 
since a cluster shared among several bins becomes harder to be detected while 
clusters appearing in too large bins are swamped by background. If the size 
and position of the bin is chosen after the inspection of the data, the power 
of the test is increased but the estimate of the p- value 2 of the null hypothesis 
inferred from x 2 {N) is unreliable; hence, the significance level of the hypothesis 
test is biased. The binning problem can be overcome employing a second 
family of test statistics connected to the Kolmogorov distribution. The most 
common test is the Kolmogorov- Smirnov test [4] which corresponds to the 
largest distance between the cumulative distribution of the data and the one of 
the null hypothesis. This distance has a characteristic distribution that can be 
computed analytically and, hence, provides the p-value of the null hypothesis 
for a given data taking. These tests are well-suited to detect global distortions 
of the x distribution but have limited power for strong local deviation from the 
null hypothesis (cluster of events). Moreover, the power depends significantly 
on the position of the signal peak. 

The procedure used to seek for event clusters after the data taking suggests 
a possible alternative to those methods, but which does not suffer from the 
problems of choosing the region a posteriori. The search is performed scanning 
the [A, B] interval to identify the region where an anomalous accumulation of 
events appears. Given N events distributed along the [A, B] range, we call 
S(w) the largest number of events in a window of fixed length w. If the dis- 
tribution of S(w) is known, it will be possible to compute the probability 
Prob(S(w) > k) for the null hypothesis to produce a cluster S(w) greater 
or equal than the one actually observed. Hence, the p-value of the null hy- 
pothesis can be assessed. In this context, an a priori binning similar to the 
one of the Pearson \ 2 test is no more needed. Moreover, the test statistics 
S(w) ("scan statistics") is not affected by the drawbacks of the Kolmogorov- 



The power of an hypothesis test against a specific alternative hypothesis is the 
chance that the test correctly rejects the null hypothesis when that alternative 
hypothesis is true; that is, the power is 100% minus the chance of a Type II error 
when that alternative hypothesis is true. 

2 For a null hypothesis Hq and a test statistic T (e.g. the one of Eq. (4) ) we define 
g{T\Ho) as the p.d.f. of T in the occurrence of Hq. The p-value is defined as the 
probability to find T in the region of equal or less compatibility with H than the 
level of compatibility observed with the actual data [5] . For example, if T is defined 
such that large values correspond to poor agreement with the hypothesis, then the 
p-value will be 



T b s being the value of the test statistic obtained in the actual experiment. 
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Smirnov (K-S) tests (see Sec. 3). Sometimes this approach is followed, at least 
qualitatively, in literature. For instance, the OPAL [6] data accumulated at 
LEP during the high energy run beyond the Z° resonance (LEP2) were used 
to falsify the ALEPH [7] claim of a peak in the dijet invariant mass M of 
the e + e~ — > four jets final state at M ~ 105 GeV. Clearly, this refutation 
was carried out using the test statistics of Eq. (3). Moreover, to test for a 
peak in the dijet mass sum distribution for arbitrary mass M and indepen- 
dent of histogram binning, the positions of the mass windows were scanned 
over the full range of M. However, no quantitative statement was drawn due 
to the strong correlations of the contents of nearby bins. In fact, accounting 
for this correlation is possible once the properties of the scan statistics S(w) 
are determined. In Sec. 2 these properties are discussed and the formulas to 
compute Prob(S(w) > k) are presented. The power and significance of the 
test statistics S(w) is computed in Sec. 3 and compared with the Pearson x 2 
and K-S approach for one-dimensional distributions. Extensions of the tests 
based on S(w) and further applications in particle physics data analyses are 
discussed in Sec. 4. 



2 Scan statistics 



Let us consider an interval [A, B] of a continuous variable x and a Poisson 
process ("background") whose mean value per unit interval is denoted with 
A. Hence, the probability of finding Y x {w) events in an interval [x,x + w] is 

Prob(Y x (w) = k) = e -^^l! ; k = 0,1,2,... (5) 



The number of events in any disjoint non-overlapping intervals are indepen- 
dently distributed. We call "scan statistic" (SS) the largest number of events 
to be found in any subinterval of [A, B] of length w 3 , i.e. 

S(w) = max {Y x (w)} (6) 



The probability that the number of events in a scanning window never reaches 
k will be denoted, following [1], as 

Q*(k, A A, w/A) = 1 - Prob(S(w) > k) (7) 

3 The case of non-uniform background can be dealt with by allowing for a window 
of variable width w(x) that always contains w/(B — A) percent of the expected 
events under the null hypothesis [8]. 



4 



where A = B — A and the suffix "*" indicates that unconditional probabilities 
are considered, i.e. that the overall number of events N in the interval [A, B] 
is not fixed but it fluctuates according to Eq. (5) with w — A. The exact 
form of Eq. (7) can be expressed in terms of the sum of products of two 
determinants [9]. The summation is over the set V of all the partitions of N 
into 2H+1 non-negative integers m; satisfying mj+mj + i < k for % — 1, . . . , 2H, 
where H is the largest integer in A/w. The determinants are computed starting 
from the (H + 1) x (H + 1) matrix {h}ij and the H x H matrix {v}ij whose 
entries are: 



2i-l 

= ^2 m s - (i - j)k 1 < j < i < H + 1 

s=2j-l 
2j-2 

= - J2m s + (j - i)k l<i<j <H + 1 

s=2i 
2i 

J2m s -(i-j)k l<j<i<H 

s=2j 
2j-l 

- ^2m a + (j-i)k l<i<j<H 

s=2i+l 



Vij 



Using these definitions for V, and v^, we have for k > 2 and w < A: 
Q*(k,XA,w/A) =J2 R * det|l/^!| det|l/^!| 



v 



In formula (8) 



in 

R* = M d M (-=- - d) N - M p{N, AA) (9) 



H 



M = J2m 2j+ i (10) 

3=0 



being d = 1 — wH/A and p(N, AA) is the Poisson probability of having iV 
events from an average rate A A. 

A very useful approximation of Eq. (8) has been derived by Naus in 1982 [10], 
based on the exact values of the probabilities Q2 = Q*(k,2ip, 1/2) and Q 3 = 
Q*(k, Sip, 1/3)) 4 . It can be shown that 

Q*(k^L,l/L)^Q* 2 [Ql/Q* 2 } L - 2 (11) 



For later convenience we define ip = Xw and L = A/w. 
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where 



Q* 2 = [F(k - 1, n 2 - (k - 1) P(k, *p)p(k - 2, V) 

— (A: — 1 — -0) p(k, ip)F(k — 3, ip) (12) 

Q* = [F(k - 1, 0)] 3 - A 1 + A 2 + A 3 - A 4 (13) 



and 



A = 2 p(k, i/j)F(k - 1,0) {(k - 1) F(k - 2, -0) - -3,-0)} 
A 2 = 0.5 [p(fc, ij)] 2 {(k-l)(k- 2) F(A; - 3, 0) 
-2 (A; - 2) 0F(&; - 4, -0) + ^ 2 F(A; - 5, ip)} 



t2 



A 3 = ^p(2^^) [F(r-1,0)]' 

r=l 
fc-l 

A A = P(^k - r, 0)p(r, 0) {(r - 1) F(r - 2, -0) - 0F(r -3,-0)} 



r=2 



In the above formulas F(k,ip) denotes the cumulative distribution 



F(M) = I>M) ; p(<,^) = e^ (14) 
i=o % - 



and F(k,ip) = for /c < 0. For large values of A/iu an even simpler approxi- 
mation due to Aim [11] can be implemented: 



Q*(k, XA,w/A) ~ 

F(k - 1, \w) exp l- k ~™ X \(A -w)p(k-l, \w) j (15) 



Eq. (15) is often used in astrophysics applications and in many time series 
problems but it is of limited use in the present case where the condition w / A <C 
1 is rarely fulfilled. In the following, the test statistics based on SS will be 
studied relying on the approximation (11). For a systematic comparison of 
the various approximations of Eq. (8) we refer to [1]. 
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3 Power and significance for one-dimensional distributions 

Sec. 2 dealt with the distribution of the scan statistics under the null hypoth- 
esis. The class of alternative hypotheses considered hereafter describe a local 
perturbation of the uniform distribution which leads to the appearance of a 
"excess" of events. Long-range distortions like anomalous angular distribu- 
tions are better dealt with global K-S tests and will not be further considered 
here. The alternative functions are Poisson processes of mean S. The signal 
events are spread along [A, B] according to a normal distribution of mean xg 
and sigma ag 5 . In the following we reject the null hypothesis if its p-value is 
smaller than 5%. The actual significance of the test statistics, i.e. the number 
of experiments where the null hypothesis was rejected albeit true, has been 
computed by Monte Carlo experimentation. Similarly the rate of Type II er- 
rors was computed to estimate the power of the test. In general, some prior 
assumptions are made before the inspection of a distribution. The domain 
[A, B] of the variable x accessible to the experiment depends on the particular 
apparatus and, in most of the cases, it is selected a priori; so it is not expected 
to be a source of biases. In fact, in many applications the final distribution 
of x is the result of sequential cuts on other kinematic variables which can 
severely bias the sample [12]. For narrow resonances, whose width does not 
exceed the (known) instrumental resolution of the detector, the scanning win- 
dow w of S(w) can be fixed a priori. In particular, for a gaussian perturbation 
of variance a 2 s a nearly optimal choice of w is w ~ 4ag [13]. 

Fig. 1 shows the power of the K-S, SS and \ 2 tests as a function of the 
signal position xg. Here, [A,B] = [0,1], a s = 0.05, B = AA = 100 and 
S = 20. The optimal bin size for the y 2 test has been computed following the 
prescription [14] A^" bin = 2(AA) 2//5 , where AA is the expected sample size in 
case of null hypothesis 6 . In Fig. 1 signal events generated beyond the interval 
[0,1] are ignored (out of the sensitivity region [A, B] ). The power averaged 
over the peak positions is shown in Fig. 2 as a function of S. A few comments 
are in order. As anticipated in Sec. 1 the K-S test is not appropriate for 
local perturbations. The power is limited compared to other statistics and 
depends on the peak position, having the highest sensitivity at the border of 
the distribution. The Pearson x 2 test has a much higher power but in general 
the peak detection efficiency is reduced when the peak is shared between 

5 This is the case, for instance, of a narrow resonance whose intrinsic width is 
smaller than the detector resolution. For resonances broader than the instrumental 
precision a relativistic Breit-Wigner or a Jacobian-peak would be more appropriate. 
However, for the present purposes the details of the alternative function are not 
critical. 

6 Other choices of the binning for the \ 2 test, based on the knowledge of ag, have 
been tested by Monte Carlo experimentation. The corresponding powers do not 
exceed the one shown in Fig. 1. 
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Fig. 1. The power of the test statistics versus the peak positions for S = 20 and 
B = AA = 100. 

two adjacent bins. On average the x 2 test underperforms w.r.t. SS since the 
correlations among the bins are ignored 7 . However, the bin prescription for 
X 2 is independent of the a priori knowledge of as while SS makes use of this 
additional information. This is a drawback for SS if the cluster width is broader 
than the instrumental resolution because the scanning window is no more 
optimized. Fig. 3 shows the average power versus as assuming w = A/Nu n 
and S = 20. The vertical line corresponds to w — Aa s . In fact, it is possible 
to compute the scan statistics for an a posteriori choice of w [15] but, clearly, 
this additional degree of freedom implies a deterioration of the power. 

On the other hand, SS has a relevant feature which is not manifest in Figs. 1,2. 
The test statistics (4) behaves as x 2 (N) only in the asymptotic limit. This im- 
plies that the prescription A^" bin = 2(AA) 2 / 5 is appropriate only if the number 
of expected events per bin is such that the normal limit is justified. If this is 
not the case, the extraction of the p-value for the null hypothesis under the 
assumption T ~ X 2 {N) is biased and the proper behavior has to be restored 
increasing the bin size or computing the correct p-values by MC experimen- 
tation [16]. This fact is immaterial for SS, since the derivation of Eqs. (8) 
and (11) does not invoke the Central Limit theorem. The unbiaseness of the 
p-value for the null hypothesis even for few events expected in the scanning 



This is the reason why the x 2 test an d the Run Test [4] are complementary. 
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Fig. 2. The power of the test statistics averaged over the peak positions versus the 
mean expected signal S. 



window has been checked by Monte Carlo experimentation. Fig. 4 shows the 
probability Prob(S(w) > k) — Prob(S(w) > k + 1) of finding exactly k events 
after a scan, computed by Monte Carlo (crosses) and by Eq. (11). The up- 
per plot shows the region with highest probability assuming B — 10, S — 0, 
w = 0.2; in this case the corresponding y 2 test with optimal binning would 
have no more than 2 events per bin. The number of trials is 10 7 so the MC 
error in the upper plot is negligible. The lower plot indicates the tail of the 
distribution. Note that the exact formula on which the Naus approximation 
is based holds for k > 1. Biases in the p-value will appear only when the 
approximation 



Prob(S(w) = 0) ~ =>• Prob(S(w) = 1) ~ 1 - Prob(S(w) > 2) (16) 



does not hold, that is when the probability of having zero events after a full 
scan is non-negligible as in Fig. 5 where AA = 2 and the first empty dot 
indicates Prob(S(w) = or S(w) = 1) = 1 - Prob(S(w) > 2). 
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Fig. 3. The power of the test statistics averaged over the peak positions versus 
the signal width as for B = 100 and S = 20. For the Pearson \ 2 test; optimal 
binning N^ n is assumed; for SS w = A/iV bin = 0.08. The vertical line corresponds 
to w = Aas- 



4 Extensions of the Scan Statistics 



In Sec. 3 the alternative hypotheses (normal distributions centered at x s G 
[A, B\) were such that events generated outside the sensitivity region [A,B] 
have been ignored. This implies a loss of power for the SS in the case of x,s 
lying near the border of the sensitivity region (see Fig. 1), i.e. when x,s — A or 
B — xs is comparable or smaller than the scanning window w. Clearly, these 
results can be extended in a straightforward manner to a bounded variable x 
on a range [^4, B] where the signal accumulates at the border. In this case the 
probability of finding a signal event between x and x + dx is 



Prob(x G [x, x + dx]) 



j(-oo^) fi( x _ A) 

G(x,x s ,a s ) 

j(B,oc) _ ^ 



x = A 



x = B 



x g (A, B) 



(17) 
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Fig. 4. The probability of seeing k events after a scan with B = 10, w = 0.2 under 
the null hypothesis, computed by MC (crosses) and Eq. (11) (empty dots). The 
high-A: tail of the distribution is shown in the lower plot. 

and zero for x outside [^4, £>]; G(x, xs, o"s) is the normal distribution with mean 
x s and variance <r| and 



u 

&® = J G(x,x s ,a s )dx 



(18) 



On the other hand, in many physics applications the alternative hypothe- 
sis (17) does not describe our signal expectation. For instance, if the bounded 
variable is connected with an angular distribution, a signal excess will mani- 
fest as a local perturbation of an uniform distribution of events along a unit 
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1 2 3 4 5 

k 

Fig. 5. The probability of seeing k events after a scan with B = 2, w = 0.2 under 
the null hypothesis, computed by MC (crosses) and Eq. (11) (empty dots). The 
first empty dot indicates the probability of having k < 2 events after a full scan: 
l-Prob(S(w) =k>2). 

circle. Event positions are described by the bounded variable 9 e [0, 1]. In this 
case S c (w) is denned as the maximum number of points in any arc of length 
w and, following the notation of Sec. 2 with A = 1, we have [10]: 

Prob(S c (w) > k) c l-g-(fe,4^,l/4) [^^^^]^ (19) 

[Q*(k,2ijj, 1/2)] 

where Q*(k, 3^, 1/3) and Q*(k,2ijj, 1/2) are given by Eqs. (12) and (13) and 
Q*(k,4ip, 1/4) can be derived from Eq. (8) with L = 4. 

Note also that in Sec. 3 we considered the parameters describing the back- 
ground known with high precision, so that it is possible to assume the null 
hypothesis to be fully specified. This is not the case if the parameters 9 of 
Eq. (1) have to be estimated after the data taking. In this case SS should be 
extended to devise the optimal estimate of the underlying background density 
B(x, 9_) that is unbiased and consistent under both the null hypothesis and the 
occurrence of a local excess of width as- This problem is still unsolved [1] for 
a generic function B(x,9_). Unbiased estimators have been obtained for simple 
functional dependences as in the case of the linear regression: for a discussion 
we refer to [1]. 
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Finally, it is worth mentioning that, even if we focused on 1-dim distributions, 
SS has been extended to multivariate problems [17,18,19]. In particular, 2-dim 
applications are quite common e.g. in space analysis of arrival direction data of 
high energy cosmic rays, x and 7-ray bursts [3]. A description of multivariate 
unconditional scan statistics can be found in [1,18]. 



5 Conclusions 

In this paper, we considered the conditions under which Scan Statistics can be 
implemented to signal a departure from the underlying probability model that 
describes the experimental data. In fact, local perturbations ("bumps" or "ex- 
cesses" of events) are better dealt within this framework and, in general, tests 
based on S(w) provide a powerful and unbiased alternative to the traditional 
techniques related with the x 2 an d Kolmogorov distributions. This holds in 
particular if the widths of the resonances are known a priori, e.g. when the 
event distribution is dominated by the instrumental resolution. Approximate 
formulas for the computation of SS in the range of interest for high energy 
and nuclear physics applications have been provided. Possible extensions to 
bounded variables and multivariate problems were also discussed. 
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